{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1f8ffd0e",
   "metadata": {},
   "source": [
    "# End-to-end Semantic Deduplication on Text Data\n",
    "\n",
    "GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540). For more information about semantic deduplication in NeMo Curator, refer to the [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) documentation page.\n",
    "\n",
    "The tutorial here shows how to run Semantic Duplication on text data by executing a single workflow which does the following:\n",
    "\n",
    "1. Read original dataset\n",
    "2. Run embedding generation\n",
    "3. Use K-Means to cluster the embeddings\n",
    "4. Compute pairwise similarity inside each of the clusters\n",
    "5. Identify duplicates based on `eps` provided (and `ranking_strategy`)\n",
    "6. Remove duplicates from the original dataset\n",
    "\n",
    "We also allow users to also run these steps independently, which can be seen in the step by step tutorial in the same directory as this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58c0cc09",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Silence Curator logs via Loguru\n",
    "os.environ[\"LOGURU_LEVEL\"] = \"ERROR\"\n",
    "\n",
    "import pandas as pd\n",
    "import pyarrow.parquet as pq\n",
    "\n",
    "input_path = os.path.abspath(\"./input\")\n",
    "semantic_out_dir = os.path.abspath(\"./output/e2e\")\n",
    "output_path = os.path.join(semantic_out_dir, \"output\")\n",
    "cache_path = os.path.join(semantic_out_dir, \"cache\")\n",
    "input_filetype = (\n",
    "    \"parquet\"  # this can be either of jsonl or parquet (you'll need to change how input data is generated)\n",
    ")\n",
    "output_filetype = \"parquet\"  # this can be either of jsonl or parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca4fbbbc",
   "metadata": {},
   "source": [
    "## Generate Input Data\n",
    "\n",
    "We generate input data if we don't have files in the path above\n",
    " - We load the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (just the train partition) which has 2,119,719 rows\n",
    " - We split into shards such that no shard has more than 10,000 rows\n",
    " - We create a new ID column which is UUID\n",
    " - We write out ~212 files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7f43245f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_curator.utils.file_utils import get_all_file_paths_under\n",
    "\n",
    "if len(get_all_file_paths_under(input_path)) == 0:\n",
    "    import os\n",
    "    import uuid\n",
    "\n",
    "    import numpy as np\n",
    "    from datasets import load_dataset\n",
    "\n",
    "    input_df = load_dataset(\"roneneldan/TinyStories\", split=\"train\").to_pandas()\n",
    "    num_rows_per_file = 10_000\n",
    "\n",
    "    os.makedirs(input_path, exist_ok=True)\n",
    "\n",
    "    for i, start_idx in enumerate(range(0, len(input_df), num_rows_per_file)):\n",
    "        end_idx = min(len(input_df), start_idx + num_rows_per_file)\n",
    "        subset_df = input_df.iloc[start_idx:end_idx].copy()\n",
    "        subset_df[\"id\"] = [str(uuid.uuid4()) for _ in range(len(subset_df))]\n",
    "        subset_df.to_parquet(os.path.join(input_path, f\"part_{i}.parquet\"), index=False)\n",
    "\n",
    "    print(f\"Created {len(os.listdir(input_path))} files\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fd4cadb",
   "metadata": {},
   "source": [
    "## Running as a Single Stage (End-to-End)\n",
    "\n",
    "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.deduplication.semantic.html#stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow) for more information about the `TextSemanticDeduplicationWorkflow` class.\n",
    "\n",
    "### Performance Notes\n",
    "Set `id_generator=True` if you want to remove duplicates from large datasets (i.e. when `perform_removal=True`).\n",
    "\n",
    "- The ID Generator gives each row a unique increasing integer ID, based on the order files are read.\n",
    "- When we find duplicates, we save these integer IDs in sorted files with multiple row groups.\n",
    "- During removal, reading the same files will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates.\n",
    "- This makes finding and removing duplicates much faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11448dfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_curator.stages.deduplication.semantic import RankingStrategy\n",
    "from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow\n",
    "\n",
    "workflow = TextSemanticDeduplicationWorkflow(\n",
    "    input_path=input_path,\n",
    "    output_path=output_path,\n",
    "    cache_path=cache_path,\n",
    "    perform_removal=True,\n",
    "    # Embedding generation parameters\n",
    "    text_field=\"text\",\n",
    "    model_identifier=\"sentence-transformers/all-MiniLM-L6-v2\",\n",
    "    embedding_max_seq_length=512,\n",
    "    embedding_max_chars=None,\n",
    "    embedding_pooling=\"mean_pooling\",\n",
    "    embedding_model_inference_batch_size=256,\n",
    "    # Semantic deduplication parameters\n",
    "    n_clusters=100,  # this number can be much higher when the data is large\n",
    "    # For large scale data removal we should use CURATOR_DEDUP_ID_STR\n",
    "    id_field=\"id\",\n",
    "    eps=0.01,\n",
    "    # K-Means clustering parameters\n",
    "    ranking_strategy=RankingStrategy(metadata_cols=[\"cosine_dist_to_cent\"], ascending=True),\n",
    "    pairwise_batch_size=1024,\n",
    "    # ID generator parameters\n",
    "    # For large scale data removal we should set use_id_generator to True\n",
    "    use_id_generator=False,\n",
    "    id_generator_state_file=None,\n",
    "    # I/O parameters\n",
    "    input_filetype=input_filetype,\n",
    "    input_files_per_partition=1,\n",
    "    output_filetype=output_filetype,\n",
    "    verbose=True,\n",
    "    clear_output=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7672bd73",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-17 14:51:14,215\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-17 14:51:14,218\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2025-09-17 14:51:15,316\tINFO usage_lib.py:447 -- Usage stats collection is disabled.\n",
      "2025-09-17 14:51:15,316\tINFO scripts.py:913 -- \u001b[37mLocal node IP\u001b[39m: \u001b[1m127.0.1.1\u001b[22m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-17 14:51:18,326\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2025-09-17 14:51:18,160\tSUCC scripts.py:949 -- \u001b[32m--------------------\u001b[39m\n",
      "2025-09-17 14:51:18,160\tSUCC scripts.py:950 -- \u001b[32mRay runtime started.\u001b[39m\n",
      "2025-09-17 14:51:18,160\tSUCC scripts.py:951 -- \u001b[32m--------------------\u001b[39m\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:953 -- \u001b[36mNext steps\u001b[39m\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:956 -- To add another node to this Ray cluster, run\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:959 -- \u001b[1m  ray start --address='127.0.1.1:6379'\u001b[22m\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:968 -- To connect to this Ray cluster:\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:970 -- \u001b[35mimport\u001b[39m\u001b[26m ray\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:971 -- ray\u001b[35m.\u001b[39m\u001b[26minit(_node_ip_address\u001b[35m=\u001b[39m\u001b[26m\u001b[33m'127.0.1.1'\u001b[39m\u001b[26m)\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:983 -- To submit a Ray job using the Ray Jobs CLI:\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:984 -- \u001b[1m  RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py\u001b[22m\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:993 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html \n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:997 -- for more information on submitting Ray jobs to the Ray cluster.\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:1002 -- To terminate the Ray runtime, run\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:1003 -- \u001b[1m  ray stop\u001b[22m\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:1006 -- To view the status of the cluster, use\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:1007 --   \u001b[1mray status\u001b[22m\u001b[26m\n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:1011 -- To monitor and debug Ray, view the dashboard at \n",
      "2025-09-17 14:51:18,160\tINFO scripts.py:1012 --   \u001b[1m127.0.0.1:8265\u001b[22m\u001b[26m\n",
      "2025-09-17 14:51:18,161\tINFO scripts.py:1019 -- \u001b[4mIf connection to the dashboard fails, check your firewall settings and network configuration.\u001b[24m\n",
      "2025-09-17 14:51:18,161\tINFO scripts.py:1123 -- \u001b[36m\u001b[1m--block\u001b[22m\u001b[39m\n",
      "2025-09-17 14:51:18,161\tINFO scripts.py:1124 -- This command will now block forever until terminated by a signal.\n",
      "2025-09-17 14:51:18,161\tINFO scripts.py:1127 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-17 14:51:19,400\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-17 14:51:19,401\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-17 14:51:19,401\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n",
      "Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 8616.66it/s]\n",
      "Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 20223.26it/s]\n",
      "2025-09-17 14:54:34,118\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-17 14:54:34,120\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-17 14:54:34,129\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(KMeansReadFitWriteStage pid=1184855)\u001b[0m 203412096\n",
      "\u001b[36m(KMeansReadFitWriteStage pid=1184857)\u001b[0m 203520000\n",
      "\u001b[36m(KMeansReadFitWriteStage pid=1184856)\u001b[0m 203520000\n",
      "\u001b[36m(KMeansReadFitWriteStage pid=1184858)\u001b[0m 203520000\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-17 14:55:08,160\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-17 14:55:08,161\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-17 14:55:08,168\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "2025-09-17 14:55:08,183\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-17 14:55:08,185\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-17 14:55:08,185\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n",
      "2025-09-17 14:55:36,596\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-17 14:55:36,598\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-17 14:55:36,607\tINFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "2025-09-17 14:55:36,623\tINFO worker.py:1630 -- Using address 127.0.1.1:6379 set in the environment variable RAY_ADDRESS\n",
      "2025-09-17 14:55:36,625\tINFO worker.py:1771 -- Connecting to existing Ray cluster at address: 127.0.1.1:6379...\n",
      "2025-09-17 14:55:36,625\tINFO worker.py:1789 -- Calling ray.init() again after it has already been called.\n"
     ]
    }
   ],
   "source": [
    "from nemo_curator.core.client import RayClient\n",
    "\n",
    "# Number of GPUs should be roughly 2x the memory of the embeddings\n",
    "client = RayClient(num_cpus=64, num_gpus=4)\n",
    "client.start()\n",
    "try:\n",
    "    workflow.run()\n",
    "finally:\n",
    "    client.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cb61451",
   "metadata": {},
   "source": [
    "### Looking at Intermediate Results and Output"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adb343b3",
   "metadata": {},
   "source": [
    "#### 1. Embeddings Results\n",
    "\n",
    "1. `id` : The ID field from our original dataset. \n",
    "    - For all subsequent steps this is assumed to be the  `id_field`.\n",
    "    - If you had set `use_id_generator` to `True` then instead of this you would see the `_curator_dedup_id` specified. \n",
    "        - The ID in that field is generated using our `IdGenerator` which assigns integer IDs to each row in the input data that later is used for removal. \n",
    "2. `embeddings` : The embedding generated by the model we used above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ef31ee41",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>embeddings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>d079fca9-d29b-4dbb-9ad1-21c667151bde</td>\n",
       "      <td>[-0.12394736707210541, 0.010744917206466198, 0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d5bb9128-2e2b-4499-9898-46e2c2059ab2</td>\n",
       "      <td>[-0.07273813337087631, 0.06685175746679306, 0....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>240d44a9-52aa-4197-a440-7d3aa88d8182</td>\n",
       "      <td>[-0.04823765903711319, 0.11327654868364334, -0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>690365e4-380b-4f63-9c75-4eb5fc5affa6</td>\n",
       "      <td>[-0.08059918135404587, 0.024182168766856194, -...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>15a0dc0c-fd96-44b3-9d41-a42441a979b4</td>\n",
       "      <td>[-0.031761009246110916, 0.02613956294953823, 0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     id  \\\n",
       "0  d079fca9-d29b-4dbb-9ad1-21c667151bde   \n",
       "1  d5bb9128-2e2b-4499-9898-46e2c2059ab2   \n",
       "2  240d44a9-52aa-4197-a440-7d3aa88d8182   \n",
       "3  690365e4-380b-4f63-9c75-4eb5fc5affa6   \n",
       "4  15a0dc0c-fd96-44b3-9d41-a42441a979b4   \n",
       "\n",
       "                                          embeddings  \n",
       "0  [-0.12394736707210541, 0.010744917206466198, 0...  \n",
       "1  [-0.07273813337087631, 0.06685175746679306, 0....  \n",
       "2  [-0.04823765903711319, 0.11327654868364334, -0...  \n",
       "3  [-0.08059918135404587, 0.024182168766856194, -...  \n",
       "4  [-0.031761009246110916, 0.02613956294953823, 0...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings_path = os.path.join(cache_path, \"embeddings\")\n",
    "\n",
    "pd.read_parquet(os.path.join(embeddings_path, os.listdir(embeddings_path)[0])).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd5057fe",
   "metadata": {},
   "source": [
    "#### 2. K-Means Results\n",
    "\n",
    "1. `id` : The IDs of the rows that belong to the cluster.\n",
    "2. `embeddings` : These are later used for pairwise similarity.\n",
    "3. `l2_dist_to_cent` / `cosine_dist_to_cent` : This represents how far (l2 distance or cosine distance) a sample is from our cluster's centroid.\n",
    "    - These fields help us define how we want to prioritize ranking within our cluster. See `RankingStrategy`\n",
    "    - If we had other `metadata_fields` provided they would be used here instead.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "02c7c9ab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>embeddings</th>\n",
       "      <th>l2_dist_to_cent</th>\n",
       "      <th>cosine_dist_to_cent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>9b4feddc-4d51-43fb-aeaa-18024a32d60c</td>\n",
       "      <td>[-0.051341612, 0.023956891, 0.0818636, 0.01455...</td>\n",
       "      <td>0.564844</td>\n",
       "      <td>0.174446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>988c12f1-4adc-4123-a1ea-872818bed68b</td>\n",
       "      <td>[-0.029556809, 0.040268984, 0.13631389, -0.001...</td>\n",
       "      <td>0.677120</td>\n",
       "      <td>0.261456</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>e8e4342d-2bed-4626-bd9e-222d0e347567</td>\n",
       "      <td>[-0.08078232, 0.032830615, 0.10732673, -0.0177...</td>\n",
       "      <td>0.545969</td>\n",
       "      <td>0.161363</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>cffd02df-29d3-4d4d-9d19-332527f372d8</td>\n",
       "      <td>[-0.016866645, 0.03228915, 0.021343842, 0.0527...</td>\n",
       "      <td>0.599426</td>\n",
       "      <td>0.199569</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>f3dd6532-0abf-4ad9-bca7-514d78085a8f</td>\n",
       "      <td>[-0.04058464, 0.023736855, 0.09525293, 0.05758...</td>\n",
       "      <td>0.553145</td>\n",
       "      <td>0.166284</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     id  \\\n",
       "0  9b4feddc-4d51-43fb-aeaa-18024a32d60c   \n",
       "1  988c12f1-4adc-4123-a1ea-872818bed68b   \n",
       "2  e8e4342d-2bed-4626-bd9e-222d0e347567   \n",
       "3  cffd02df-29d3-4d4d-9d19-332527f372d8   \n",
       "4  f3dd6532-0abf-4ad9-bca7-514d78085a8f   \n",
       "\n",
       "                                          embeddings  l2_dist_to_cent  \\\n",
       "0  [-0.051341612, 0.023956891, 0.0818636, 0.01455...         0.564844   \n",
       "1  [-0.029556809, 0.040268984, 0.13631389, -0.001...         0.677120   \n",
       "2  [-0.08078232, 0.032830615, 0.10732673, -0.0177...         0.545969   \n",
       "3  [-0.016866645, 0.03228915, 0.021343842, 0.0527...         0.599426   \n",
       "4  [-0.04058464, 0.023736855, 0.09525293, 0.05758...         0.553145   \n",
       "\n",
       "   cosine_dist_to_cent  \n",
       "0             0.174446  \n",
       "1             0.261456  \n",
       "2             0.161363  \n",
       "3             0.199569  \n",
       "4             0.166284  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kmeans_path_first_centroid = os.path.join(cache_path, \"semantic_dedup\", \"kmeans_results\", \"centroid=0\")\n",
    "\n",
    "pd.read_parquet(os.path.join(kmeans_path_first_centroid, os.listdir(kmeans_path_first_centroid)[0])).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27aaac0b",
   "metadata": {},
   "source": [
    "#### 3. Pairwise Similarity Result\n",
    "\n",
    "1. `id` : The identifier for the duplicate row.\n",
    "2. `max_id` : The closest pair for the duplicate row.\n",
    "3. `cosine_sim_score` : The cosine similarity between the two points.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4f741101",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>max_id</th>\n",
       "      <th>cosine_sim_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b6c8eb2e-bd73-4492-a927-f52d96c7a967</td>\n",
       "      <td>b6c8eb2e-bd73-4492-a927-f52d96c7a967</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>e348775f-27bc-4ad2-b772-bd8d490c6af4</td>\n",
       "      <td>b6c8eb2e-bd73-4492-a927-f52d96c7a967</td>\n",
       "      <td>0.932138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>9989bc6c-ed4c-43a1-9819-5c2cb137f736</td>\n",
       "      <td>b6c8eb2e-bd73-4492-a927-f52d96c7a967</td>\n",
       "      <td>0.936246</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70314669-7849-4f71-aabb-6af954e80c42</td>\n",
       "      <td>b6c8eb2e-bd73-4492-a927-f52d96c7a967</td>\n",
       "      <td>0.915550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>dd2ec313-2904-43dd-a522-8d643b492a05</td>\n",
       "      <td>b6c8eb2e-bd73-4492-a927-f52d96c7a967</td>\n",
       "      <td>0.928383</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     id                                max_id  \\\n",
       "0  b6c8eb2e-bd73-4492-a927-f52d96c7a967  b6c8eb2e-bd73-4492-a927-f52d96c7a967   \n",
       "1  e348775f-27bc-4ad2-b772-bd8d490c6af4  b6c8eb2e-bd73-4492-a927-f52d96c7a967   \n",
       "2  9989bc6c-ed4c-43a1-9819-5c2cb137f736  b6c8eb2e-bd73-4492-a927-f52d96c7a967   \n",
       "3  70314669-7849-4f71-aabb-6af954e80c42  b6c8eb2e-bd73-4492-a927-f52d96c7a967   \n",
       "4  dd2ec313-2904-43dd-a522-8d643b492a05  b6c8eb2e-bd73-4492-a927-f52d96c7a967   \n",
       "\n",
       "   cosine_sim_score  \n",
       "0          0.000000  \n",
       "1          0.932138  \n",
       "2          0.936246  \n",
       "3          0.915550  \n",
       "4          0.928383  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pairwise_path = os.path.join(cache_path, \"semantic_dedup\", \"pairwise_results\")\n",
    "\n",
    "pd.read_parquet(os.path.join(pairwise_path, \"cluster_0.parquet\")).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ee21190",
   "metadata": {},
   "source": [
    "#### Looking at Similar Results\n",
    "\n",
    "We can look at two rows and see why the embedding model thinks they're similar.\n",
    "We can use this to guide our decision about the `eps` parameter.\n",
    "\n",
    "We use two `9989bc6c-ed4c-43a1-9819-5c2cb137f736` and\t`b6c8eb2e-bd73-4492-a927-f52d96c7a967` which have cosine similarity of\t`0.936246`.\n",
    "\n",
    "And we notice that the theme of the data is very similar. \n",
    "\n",
    "**NOTE: If you run with `use_id_generator=True` (which is important to perform removal at large scale) you will see the IDs which were generated internally, and there won't be a way to perform this step, as there is no simple way of getting the old mappings of IDs to the new mappings of IDs.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d1abf127",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'id': '9989bc6c-ed4c-43a1-9819-5c2cb137f736',\n",
      "  'text': 'Once upon a time, there was a little girl named Lily. She loved to '\n",
      "          'play outside in the lovely sunshine. One day, she went to the park '\n",
      "          'with her mommy. They walked on the ground and saw many flowers.\\n'\n",
      "          '\\n'\n",
      "          'Lily saw a boy playing with a ball. She wanted to play too, but she '\n",
      "          'was shy. Her mommy told her to ask the boy if she could play. Lily '\n",
      "          'was nervous, but she asked the boy if she could play with him. The '\n",
      "          'boy just shrugged and said \"okay\".\\n'\n",
      "          '\\n'\n",
      "          'Lily and the boy played together and had lots of fun. They kicked '\n",
      "          'the ball and ran around. When it was time to go home, Lily said '\n",
      "          \"goodbye to the boy. As they walked away, Lily's mommy told her that \"\n",
      "          'it was good she asked to play. Lily smiled and felt happy. She knew '\n",
      "          'she would always remember that lovely day at the park.'},\n",
      " {'id': 'b6c8eb2e-bd73-4492-a927-f52d96c7a967',\n",
      "  'text': 'Once upon a time, there was a little girl named Lily. She loved to '\n",
      "          'play with her toys all day long, but sometimes she felt lonely. One '\n",
      "          'day, her mommy said, \"Lily, let\\'s go to the park and play with '\n",
      "          'other kids.\" Lily was so happy and excited!\\n'\n",
      "          '\\n'\n",
      "          'At the park, Lily saw many kids playing, but they all seemed so '\n",
      "          \"distant from her. She didn't know how to join in. Suddenly, a kind \"\n",
      "          'boy came up to her and said, \"Hi, do you want to play with us?\" '\n",
      "          'Lily smiled and said, \"Yes, please!\" The boy provided her with a '\n",
      "          'ball, and they all started to play together.\\n'\n",
      "          '\\n'\n",
      "          \"Lily had so much fun playing with her new friends. She didn't want \"\n",
      "          'the time to end, but eventually, it was time to go home. Lily said '\n",
      "          'goodbye to her new friends, feeling happy and grateful for the fun '\n",
      "          'time they had together.'}]\n"
     ]
    }
   ],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "pprint(\n",
    "    pd.read_parquet(\n",
    "        input_path,\n",
    "        filters=[(\"id\", \"in\", {\"9989bc6c-ed4c-43a1-9819-5c2cb137f736\", \"b6c8eb2e-bd73-4492-a927-f52d96c7a967\"})],\n",
    "    ).to_dict(orient=\"records\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c446a67",
   "metadata": {},
   "source": [
    "##### Visualizing Similarity in the Dataset\n",
    "\n",
    "\n",
    "Depending on our dataset size we can read through all of the files and plot how much data is similar to one another.\n",
    "Here we show how to read file by file and then perform a reduce. \n",
    "\n",
    "In our dataset we can see that ~20% of our data has cosine similarity of 0.9 or more.\n",
    "\n",
    "Based on the analysis here and above (where we see similar text fields) we can decide what our `eps` should be. \n",
    "\n",
    "Note that in this tutorial we pre-ran with `eps` set and `perform_removal` as `True`.\n",
    "\n",
    "Ideally, users do this analysis, inspect the duplicates, come up with an `eps`, run a pipeline that includes the `IdentifyDuplicates` stage, and finally perform removal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a548c4fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "from functools import reduce\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def get_bins(df: pd.DataFrame, num_bins: int = 1_000) -> dict[float, int]:\n",
    "    bins = np.linspace(0, 1.01, num_bins)\n",
    "\n",
    "    return Counter(\n",
    "        pd.cut(df[\"cosine_sim_score\"], bins=bins, labels=bins[1:], retbins=False, include_lowest=True, right=True)\n",
    "        .value_counts()\n",
    "        .to_dict()\n",
    "    )\n",
    "\n",
    "\n",
    "similarity_across_dataset = reduce(\n",
    "    lambda x, y: x + y,\n",
    "    [\n",
    "        get_bins(pd.read_parquet(os.path.join(pairwise_path, f), columns=[\"cosine_sim_score\"]), num_bins=1000)\n",
    "        for f in os.listdir(pairwise_path)\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "2db46cf5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAG2CAYAAACDLKdOAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjMsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvZiW1igAAAAlwSFlzAAAPYQAAD2EBqD+naQAAWkFJREFUeJzt3XtYlNXePvB7QGYQBUGRg0SCp9RUINjyohmWKKXbQ7XTopQw3TuVrTkZiQcQNTFTxNSiPGeZdjDbpZskDMpEMJS2eSpPYSp4SlDQYWDW7w9+TI4gzjPOmftzXVwvs+aZNTcj7+bbWt/neWRCCAEiIiIiO+Fg6QBERERExsTihoiIiOwKixsiIiKyKyxuiIiIyK6wuCEiIiK7wuKGiIiI7AqLGyIiIrIrLG6IiIjIrrC4ISIiIrvC4oaIiIjsikWLm++//x5Dhw5Fu3btIJPJsG3btru+JicnBw899BAUCgU6deqE9evXmzwnERER2Q6LFjcVFRUICgrCypUr9Tr+1KlTGDJkCB599FEUFRXhlVdewbhx4/DNN9+YOCkRERHZCpm13DhTJpPhiy++wIgRI+54zOuvv47t27fjl19+0Y49++yzuHr1KjIzM82QkoiIiKxdM0sHkCIvLw9RUVE6Y9HR0XjllVfu+BqVSgWVSqV9rNFocOXKFbRp0wYymcxUUYmIiMiIhBC4du0a2rVrBweHxjeebKq4KSkpgbe3t86Yt7c3ysvLcePGDTRv3rzea1JTU5GSkmKuiERERGRCZ86cwX333dfoMTZV3BgiMTERSqVS+7isrAz3338/Tp06BVdXV6O9j1qtxnfffYdHH30UTk5ORpvXlGwxM2CbuZnZPJjZPJjZPGwxc3nFTUQu3QMA+DHhEbjIjVdmXLt2DYGBgXr97bap4sbHxwelpaU6Y6WlpXBzc2tw1QYAFAoFFApFvfHWrVvDzc3NaNnUajVcXFzQpk0bm/kltMXMgG3mZmbzYGbzYGbzsMXMzZxvwEHhAgBo06aNUYubus9An5YSm7rOTUREBLKzs3XGsrKyEBERYaFEREREZG0sWtxcv34dRUVFKCoqAlB7qndRURGKi4sB1G4pjRkzRnv8yy+/jJMnTyIhIQFHjx7FO++8g08++QRTp061RHwiIiKyQhYtbn766SeEhIQgJCQEAKBUKhESEoKkpCQAwPnz57WFDgAEBgZi+/btyMrKQlBQEJYsWYLVq1cjOjraIvmJiIjI+li056Z///5o7DI7DV19uH///jhw4IAJUxEREZEts6meGyIiIqK7YXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERER2RUWN0RERGRXWNwQERGRXWFxQ0RERHaFxQ0RERHZFRY3REREZFdY3BAREZFdYXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERER2RUWN0RERGRXWNwQERGRXWFxQ0RERHaFxQ0RERHZFRY3REREZFdY3BAREZFdYXFDREREdsXixc3KlSsREBAAZ2dnhIeHo6Cg4I7HqtVqzJ07Fx07doSzszOCgoKQmZlpxrRERERk7Sxa3GzZsgVKpRLJycnYv38/goKCEB0djQsXLjR4/KxZs/Dee+9h+fLlOHz4MF5++WU8+eSTOHDggJmTExERkbWyaHGTlpaG8ePHIy4uDt27d0dGRgZcXFywdu3aBo/fuHEjZsyYgcGDB6NDhw6YMGECBg8ejCVLlpg5OREREVmrZpZ646qqKhQWFiIxMVE75uDggKioKOTl5TX4GpVKBWdnZ52x5s2bY/fu3Xd8H5VKBZVKpX1cXl4OoHaLS61W38uPoKNuLmPOaWq2mBmwzdzMbB7MbB7MbB62mbn6lu/VUMuEEefW/3OQCSGM984SnDt3Dn5+ftizZw8iIiK04wkJCcjNzUV+fn6918TExODnn3/Gtm3b0LFjR2RnZ2P48OGoqanRKWBuNWfOHKSkpNQb37RpE1xcXIz3AxERETVxqhogoaB23WRR72ooHI03d2VlJWJiYlBWVgY3N7dGj7XYyo0hli1bhvHjx6Nr166QyWTo2LEj4uLi7riNBQCJiYlQKpXax+Xl5fD398egQYPu+uFIoVarkZWVhYEDB8LJyclo85qSLWYGbDM3M5sHM5sHM5uHLWYuq7gJFHwPAIiOHgQXufHKjLqdF33c07vevHmz3jaRvjw9PeHo6IjS0lKd8dLSUvj4+DT4mrZt22Lbtm24efMmLl++jHbt2mH69Ono0KHDHd9HoVBAoVDUG3dycjLJL4up5jUlW8wM2GZuZjYPZjYPZjYPW8rs5FR9y/dOcHIyXnEj5TOQ3FCs0Wgwb948+Pn5oWXLljh58iQAYPbs2VizZo3e88jlcoSGhiI7O1tn7uzsbJ1tqoY4OzvDz88P1dXV+PzzzzF8+HCpPwYRERHZKcnFzfz587F+/XosWrQIcrlcO96jRw+sXr1a0lxKpRKrVq3Chg0bcOTIEUyYMAEVFRWIi4sDAIwZM0an4Tg/Px9bt27FyZMn8cMPP+Dxxx+HRqNBQkKC1B+DiIiI7JTk9aIPPvgA77//PgYMGICXX35ZOx4UFISjR49KmmvUqFG4ePEikpKSUFJSguDgYGRmZsLb2xsAUFxcDAeHv+qvmzdvYtasWTh58iRatmyJwYMHY+PGjXB3d5f6YxAREZGdklzcnD17Fp06dao3rtFoDDpdLT4+HvHx8Q0+l5OTo/M4MjIShw8flvweRERE1HRI3pbq3r07fvjhh3rjn332GUJCQowSioiIiMhQkldukpKSEBsbi7Nnz0Kj0WDr1q04duwYPvjgA3z99demyEhERESkN8krN8OHD8dXX32Fb7/9Fi1atEBSUhKOHDmCr776CgMHDjRFRiIiIiK9SVq5qa6uxoIFCzB27FhkZWWZKhMRERGRwSSt3DRr1gyLFi1CdXX13Q8mIiIisgDJ21IDBgxAbm6uKbIQERER3TPJDcVPPPEEpk+fjoMHDyI0NBQtWrTQeX7YsGFGC0dEREQkleTiZuLEiQCAtLS0es/JZDLU1NTceyoiIiIiA0kubjQajSlyEBERERmF5J4bIiIiImtm0L3Ic3NzsXjxYhw5cgRA7VWLX3vtNfTr18+o4YiIiMg4hBC4oTZt68iNKutoTZFc3Hz44YeIi4vDU089hcmTJwMAfvzxRwwYMADr169HTEyM0UMSERGRNLcWM0IAz2Tk4fD5cgunMg/Jxc0bb7yBRYsWYerUqdqxyZMnIy0tDfPmzWNxQ0REZCF1BY2li5nQ+93R3MnRIu8NGFDcnDx5EkOHDq03PmzYMMyYMcMooYiIiEgajUbg78t3N1rQdPd1w6cvR0AmM00GtVqNb77ZiRF//xtkpnoTPUgubvz9/ZGdnY1OnTrpjH/77bfw9/c3WjAiIiLSj0YjMCAtF6cuVeiM317MNHdyNGnRoZYJKBxh0cIGMKC4efXVVzF58mQUFRWhT58+AGp7btavX49ly5YZPSARERE1TAiByqoa/H35bm1hE+jZAl//+2HIZKYvZqyV5OJmwoQJ8PHxwZIlS/DJJ58AALp164YtW7Zg+PDhRg9IRERE9TW0DRXo2QLZykg4ODS9guZWBp0K/uSTT+LJJ580dhYiIiLSQ0PbUN193fD1vx9u8oUNYEBxs2/fPmg0GoSHh+uM5+fnw9HREWFhYUYLR0RERLqEEA1uQ7nIm+YWVEMkX6F40qRJOHPmTL3xs2fPYtKkSUYJRURERA2rrKrRbkXVbUO1UDRjYXMLycXN4cOH8dBDD9UbDwkJweHDh40SioiIiOqr67Opw22ohkkubhQKBUpLS+uNnz9/Hs2aGdTCQ0RERHdx+3ZUd183uMgtd6E8aya5uBk0aBASExNRVlamHbt69SpmzJiBgQMHGjUcERER1bp9O6r2dG+u2jRE8lLL4sWL8cgjj6B9+/YICQkBABQVFcHb2xsbN240ekAiIqKmTgiBZzLytI+5HdU4ycWNn58f/ve//+Gjjz7Czz//jObNmyMuLg7PPfccnJycTJGRiIioSbt11YbbUXdnUJNMixYt8M9//tPYWYiIiOg2tU3EP2of195Ogas2jZHcc7NhwwZs375d+zghIQHu7u7o06cPfv/9d8kBVq5ciYCAADg7OyM8PBwFBQWNHp+eno4HHngAzZs3h7+/P6ZOnYqbN29Kfl8iIiJrJwQw4t29bCKWSHJxs2DBAjRv3hwAkJeXhxUrVmDRokXw9PTE1KlTJc21ZcsWKJVKJCcnY//+/QgKCkJ0dDQuXLjQ4PGbNm3C9OnTkZycjCNHjmDNmjXYsmUL70ZORER2qUoDHCm5BoBNxFJILm7OnDmjvSP4tm3b8I9//AP//Oc/kZqaih9++EHSXGlpaRg/fjzi4uLQvXt3ZGRkwMXFBWvXrm3w+D179qBv376IiYlBQEAABg0ahOeee+6uqz1ERES2jk3E+pPcc9OyZUtcvnwZ999/P3bu3AmlUgkAcHZ2xo0bN/Sep6qqCoWFhUhMTNSOOTg4ICoqCnl5eQ2+pk+fPvjwww9RUFCA3r174+TJk9ixYwdGjx59x/dRqVRQqVTax+XltQ1ZarUaarVa77x3UzeXMec0NVvMDNhmbmY2D2Y2D2Y2j6qqKiz75a8tqOpqNdQOwoKJ7s6Un7OUOWVCCEmf1PPPP4+jR48iJCQEH3/8MYqLi9GmTRv85z//wYwZM/DLL7/oNc+5c+fg5+eHPXv2ICIiQjuekJCA3Nxc5OfnN/i6t99+G9OmTYMQAtXV1Xj55Zfx7rvv3vF95syZg5SUlHrjmzZtgouLi15ZiYiIzE1VAyQU1K5B+LkIvNarBk15R6qyshIxMTEoKyuDm5tbo8dKXrlZuXIlZs2ahTNnzuDzzz9HmzZtAACFhYV47rnnDEusp5ycHCxYsADvvPMOwsPDcfz4cUyZMgXz5s3D7NmzG3xNYmKidnUJqF258ff3x6BBg+764UihVquRlZWFgQMH2swp8baYGbDN3MxsHsxsHsxsekIIDHsnD8B1AMB25QC0UFj/XQBM+TnX7bzoQ/In5e7ujhUrVtQbb2h1pDGenp5wdHSsdyuH0tJS+Pj4NPia2bNnY/To0Rg3bhwAoGfPnqioqMA///lPzJw5Ew4O9VuIFAoFFApFvXEnJyeT/IKbal5TssXMgG3mZmbzYGbzYGbTqVBV42hJbWHTzccVrVo421QjsSk+ZynzSW4oNha5XI7Q0FBkZ2drxzQaDbKzs3W2qW5VWVlZr4BxdKzdj5S4u0ZERGSVbr8a8cfj/mZThY01sOgal1KpRGxsLMLCwtC7d2+kp6ejoqICcXFxAIAxY8bAz88PqampAIChQ4ciLS0NISEh2m2p2bNnY+jQodoih4iIyJbdUP91NWI/F8Hr2hjAosXNqFGjcPHiRSQlJaGkpATBwcHIzMyEt7c3AKC4uFhnpWbWrFmQyWSYNWsWzp49i7Zt22Lo0KF44403LPUjEBERGdWtGxFTetRw1cYAFu9Oio+PR3x8fIPP5eTk6Dxu1qwZkpOTkZycbIZkRERE5nX7lhQZRnLPTXJyskG3WSAiIqLG3XqDzG4+rpBbrDPWtkn+2L788kt07NgRAwYMwKZNm3QukEdERESGabiR2IKBbJjk4qaoqAj79u3Dgw8+iClTpsDHxwcTJkzAvn37TJGPiIioSbi1kZg3yLw3Bi14hYSE4O2338a5c+ewZs0a/PHHH+jbty969eqFZcuWoayszNg5iYiI7NqtjcSfvhzBRuJ7cE+7eUIIqNVqVFVVQQgBDw8PrFixAv7+/tiyZYuxMhIREdm127ekWNfcG4OKm8LCQsTHx8PX1xdTp05FSEgIjhw5gtzcXPz222944403MHnyZGNnJSIisku3b0k1d+KW1L2QXNz07NkT//d//4dTp05hzZo1OHPmDBYuXIhOnTppj3nuuedw8eJFowYlIiJqCrglde8kX+dm5MiRGDt2LPz8/O54jKenJzQazT0FIyIiaipu7bdhXXPvJK/c1PXW3O7GjRuYO3euUUIRERE1Fbxwn/FJLm5SUlJw/fr1euOVlZWS7wxORETU1LHfxvgMWrlpaC/w559/RuvWrY0SioiIqKngKeDGp3fPjYeHB2QyGWQyGbp06aLz4dfU1OD69et4+eWXTRKSiIjIHvEUcNPQu7hJT0+HEAJjx45FSkoKWrVqpX1OLpcjICAAERERJglJRERkj7glZRp6FzexsbEAgMDAQPTp0wdOTk4mC0VERNTUcEvKePQqbsrLy+Hm5gag9tYLN27cwI0bNxo8tu44IiIiahxPATcNvYobDw8PnD9/Hl5eXnB3d2+wsqxrNK6pqTF6SCIiInvDU8BNR6/iZteuXdozob777juTBiIiImoK2G9jOnoVN5GRkQCA6upq5ObmYuzYsbjvvvtMGoyIiMie8RRw05F0nZtmzZrhrbfeQnV1tanyEBER2T2eAm5aki/i99hjjyE3N9cUWYiIiJoEbkmZluQbZz7xxBOYPn06Dh48iNDQULRo0ULn+WHDhhktHBERkT3ilpRpSS5uJk6cCABIS0ur9xzPliIiImoct6RMT3Jxo9FoTJGDiIioSeCWlOlJ7rkhIiIi4+CWlGlIXrkBgIqKCuTm5qK4uBhVVVU6z02ePNkowYiIiOwd6xrTkFzcHDhwAIMHD0ZlZSUqKirQunVrXLp0CS4uLvDy8mJxQ0RERBYleVtq6tSpGDp0KP788080b94ce/fuxe+//47Q0FAsXrzYoBArV65EQEAAnJ2dER4ejoKCgjse279/f8hksnpfQ4YMMei9iYiIzOnWM6XINCQXN0VFRXj11Vfh4OAAR0dHqFQq+Pv7Y9GiRZgxY4bkAFu2bIFSqURycjL279+PoKAgREdH48KFCw0ev3XrVpw/f1779csvv8DR0RHPPPOM5PcmIiIyJ95PyjwkFzdOTk5wcKh9mZeXF4qLiwEArVq1wpkzZyQHSEtLw/jx4xEXF4fu3bsjIyMDLi4uWLt2bYPHt27dGj4+PtqvrKwsuLi4sLghIiKrxzOlzENyz01ISAj27duHzp07IzIyEklJSbh06RI2btyIHj16SJqrqqoKhYWFSExM1I45ODggKioKeXn6VbZr1qzBs88+W+9ignVUKhVUKpX2cXl57S+VWq2GWq2WlLcxdXMZc05Ts8XMgG3mZmbzYGbzYOZ7yfHX7Ys2vRTW6O2MrCWzFKbMLGVOmRDSdv9++uknXLt2DY8++iguXLiAMWPGYM+ePejcuTPWrl2LoKAgvec6d+4c/Pz8sGfPHkRERGjHExISkJubi/z8/EZfX1BQgPDwcOTn56N3794NHjNnzhykpKTUG9+0aRNcXFz0zkpERHSvVDVAQkHtusKi3tVQcOFGb5WVlYiJiUFZWRnc3NwaPVbyyk1YWJj2ey8vL2RmZkpPaCRr1qxBz54971jYAEBiYiKUSqX2cXl5Ofz9/TFo0KC7fjhSqNVqZGVlYeDAgXBycjLavKZki5kB28zNzObBzObBzIYRQmD4O3sBXAMAREcPgov8zn+GrSGzVKbMXLfzog+DrnNjLJ6ennB0dERpaanOeGlpKXx8fBp9bUVFBTZv3oy5c+c2epxCoYBCoag37uTkZJJfFlPNa0q2mBmwzdzMbB7MbB7MLE1lVTWOlNQWNt193eDm4qzXBfz4Of81p770Km5CQkL0voLi/v379X5zuVyO0NBQZGdnY8SIEQBqb++QnZ2N+Pj4Rl/76aefQqVS4YUXXtD7/YiIiKwBr0xsWnoVN3WFhykolUrExsYiLCwMvXv3Rnp6OioqKhAXFwcAGDNmDPz8/JCamqrzujVr1mDEiBFo06aNybIREREZy60drqxrTEuv4iY5OdlkAUaNGoWLFy8iKSkJJSUlCA4ORmZmJry9vQEAxcXF2lPP6xw7dgy7d+/Gzp07TZaLiIjIWHh9G/OyaM9Nnfj4+DtuQ+Xk5NQbe+CBByDxJC8iIiKL4fVtzEuv4qZ169b49ddf4enpCQ8Pj0b3Ca9cuWK0cERERPaG/Tamp1dxs3TpUri6ugIA0tPTTZmHiIjI7rDfxrz0Km5iY2Mb/J6IiIgax34b8zO45+bChQu4cOECNBqNznivXr3uORQREZG9YL+N+UkubgoLCxEbG4sjR47Ua+qVyWSoqakxWjgiIiJ7wn4b85Bc3IwdOxZdunTBmjVr4O3tzX8kIiKiRrDfxvwkFzcnT57E559/jk6dOpkiDxERkd1gv41lONz9EF0DBgzAzz//bIosREREdoX9NpYheeVm9erViI2NxS+//IIePXrUu5HVsGHDjBaOiIjIXrDfxnwkFzd5eXn48ccf8d///rfec2woJiIiahjrGvORvC3173//Gy+88ALOnz8PjUaj88XChoiIiCxNcnFz+fJlTJ06VXtjSyIiIiJrIrm4eeqpp/Ddd9+ZIgsREZFd4T2eLUNyz02XLl2QmJiI3bt3o2fPnvUaiidPnmy0cERERLaKp4FbjkFnS7Vs2RK5ubnIzc3VeU4mk7G4ISIiAk8DtyTJxc2pU6dMkYOIiMhu8TRw85Lcc0NERER3x9suWI5eKzdKpRLz5s1DixYtoFQqGz02LS3NKMGIiIhsFfttLEuv4ubAgQNQq9Xa7++ES25ERETst7E0vYqbW0/95mngRERE+mO/jfndc89NeXk5tm3bhqNHjxojDxERkV1hXWN+koubkSNHYsWKFQCAGzduICwsDCNHjkTPnj3x+eefGz0gERGRreHF+yxLcnHz/fffo1+/fgCAL774AkIIXL16FW+//Tbmz59v9IBERES2hM3Elie5uCkrK0Pr1q0BAJmZmXj66afh4uKCIUOG4LfffjN6QCIiIlvCZmLLk1zc+Pv7Iy8vDxUVFcjMzMSgQYMAAH/++SecnZ2NHpCIiMhWsZnYMiRfofiVV17B888/j5YtW6J9+/bo378/gNrtqp49exo7HxERkc1iXWMZklduJk6ciL1792Lt2rXYvXs3HBxqp+jQoYNBPTcrV65EQEAAnJ2dER4ejoKCgkaPv3r1KiZNmgRfX18oFAp06dIFO3bskPy+REREpsBmYsuTvHIDAKGhoQgNDdUZGzJkiOR5tmzZAqVSiYyMDISHhyM9PR3R0dE4duwYvLy86h1fVVWFgQMHwsvLC5999hn8/Pzw+++/w93d3ZAfg4iIyKjYTGwdDCpujCUtLQ3jx49HXFwcACAjIwPbt2/H2rVrMX369HrHr127FleuXMGePXvg5OQEAAgICDBnZCIiojtiM7F1sFhxU1VVhcLCQiQmJmrHHBwcEBUVhby8hqve//znP4iIiMCkSZPw5Zdfom3btoiJicHrr78OR8eGf4FUKhVUKpX2cXl57S+dWq3W3lLCGOrmMuacpmaLmQHbzM3M5sHM5sHMjb1Ptfb7TS+Fobq6upGj7zYXP+eG5taHTAjL7A6eO3cOfn5+2LNnDyIiIrTjCQkJyM3NRX5+fr3XdO3aFadPn8bzzz+PiRMn4vjx45g4cSImT56M5OTkBt9nzpw5SElJqTe+adMmuLi4GO8HIiKiJk9VAyQU1K4bLOpdDQUXboymsrISMTExKCsrg5ubW6PHWnRbSiqNRgMvLy+8//77cHR0RGhoKM6ePYu33nrrjsVNYmKizp3My8vL4e/vj0GDBt31w5FCrVYjKysLAwcO1G6ZWTtbzAzYZm5mNg9mNg9mvrMKVTUSCnYBAKKjB8FFbvifWX7Ouup2XvRh0Kf+ww8/4L333sOJEye0jb0bN25EYGAgHn74Yb3m8PT0hKOjI0pLS3XGS0tL4ePj0+BrfH194eTkpLMF1a1bN5SUlKCqqgpyubzeaxQKBRQKRb1xJycnk/yymGpeU7LFzIBt5mZm82Bm82BmXUIIxLyz97b3uvc1BH7Of82pL8mngn/++eeIjo5G8+bNceDAAW0/S1lZGRYsWKD3PHK5HKGhocjOztaOaTQaZGdn62xT3apv3744fvw4NBqNduzXX3+Fr69vg4UNERGRubCZ2HpILm7mz5+PjIwMrFq1SqeK6tu3L/bv3y9pLqVSiVWrVmHDhg04cuQIJkyYgIqKCu3ZU2PGjNFpOJ4wYQKuXLmCKVOm4Ndff8X27duxYMECTJo0SeqPQUREZDK8MrFlSV4vO3bsGB555JF6461atcLVq1clzTVq1ChcvHgRSUlJKCkpQXBwMDIzM+Ht7Q0AKC4u1l4kEKi99cM333yDqVOnolevXvDz88OUKVPw+uuvS/0xiIiITIZ1jWVJLm58fHxw/PjxeteX2b17Nzp06CA5QHx8POLj4xt8Licnp95YREQE9u7dW/9gIiIiIhiwLTV+/HhMmTIF+fn5kMlkOHfuHD766CNMmzYNEyZMMEVGIiIiIr1JXrmZPn06NBoNBgwYgMrKSjzyyCNQKBSYNm0a/v3vf5siIxERkdXjPaWsh+TiRiaTYebMmXjttddw/PhxXL9+Hd27d0fLli1NkY+IiMjq8Z5S1sXgE/Dlcjm6d+9uzCxEREQ2iaeBWxfJxU1FRQUWLlyI7OxsXLhwQeeaMwBw8uRJo4UjIiKyNTwN3PIkFzfjxo1Dbm4uRo8eDV9fX/4DEhER3YJ/Fi1PcnHz3//+F9u3b0ffvn1NkYeIiMjmsJnYukg+FdzDwwOtW7c2RRYiIiKbw2Zi6yO5uJk3bx6SkpJQWVlpijxEREQ2hc3E1kevbamQkBCd3prjx4/D29sbAQEB9e7SKfX+UkRERPaCzcTWQa/iZsSIESaOQUREZPtY11gHvYqb5ORkU+cgIiKySWwmtj6Se246dOiAy5cv1xu/evWqQTfOJCIislVsJrZOkoub06dPo6ampt64SqXCH3/8YZRQREREtoDNxNZJ7+vc/Oc//9F+/80336BVq1baxzU1NcjOzkZgYKBx0xEREdkINhNbD72Lm7qmYplMhtjYWJ3nnJycEBAQgCVLlhg1HBERka1gXWM99C5u6u4hFRgYiH379sHT09NkoYiIiGwBm4mtk+TbL5w6dcoUOYiIiGwKm4mtl+SGYiIiImIzsTVjcUNERHSP2ExsXVjcEBER3SPWNdaFxQ0REZEB2ExsvSQXN2PGjMG6detw4sQJU+QhIiKyemwmtm6Sixu5XI7U1FR07twZ/v7+eOGFF7B69Wr89ttvpshHRERkddhMbN0kFzerV6/Gr7/+ijNnzmDRokVo2bIllixZgq5du+K+++4zRUYiIiKrxWZi62Nwz42HhwfatGkDDw8PuLu7o1mzZmjbtq0xsxEREVk91jXWR3JxM2PGDPTp0wdt2rTB9OnTcfPmTUyfPh0lJSU4cOCAQSFWrlyJgIAAODs7Izw8HAUFBXc8dv369ZDJZDpfzs7OBr0vERER2R/JVyheuHAh2rZti+TkZDz11FPo0qXLPQXYsmULlEolMjIyEB4ejvT0dERHR+PYsWPw8vJq8DVubm44duyY9jGXA4mIiKiO5JWbAwcOYObMmSgoKEDfvn3h5+eHmJgYvP/++/j1118lB0hLS8P48eMRFxeH7t27IyMjAy4uLli7du0dXyOTyeDj46P98vb2lvy+REREZJ8kr9wEBQUhKCgIkydPBgD8/PPPWLp0KSZNmgSNRoOamhq956qqqkJhYSESExO1Yw4ODoiKikJe3p1Psbt+/Trat28PjUaDhx56CAsWLMCDDz7Y4LEqlQoqlUr7uLy8trtdrVZDrVbrnfVu6uYy5pymZouZAdvMzczmwczmwcxAVVW1ztxqmfEvesPPueG59SETQtpliIQQOHDgAHJycpCTk4Pdu3ejvLwcvXr1QmRkJJYuXar3XOfOnYOfnx/27NmDiIgI7XhCQgJyc3ORn59f7zV5eXn47bff0KtXL5SVlWHx4sX4/vvvcejQoQbP1pozZw5SUlLqjW/atAkuLi56ZyUiIgJqL9731v8ccbaytiViUe9qKHgmuMlVVlYiJiYGZWVlcHNza/RYycWNh4cHrl+/jqCgIERGRqJ///7o168f3N3dJQc1pLi5nVqtRrdu3fDcc89h3rx59Z5vaOXG398fly5duuuHI4VarUZWVhYGDhwIJycno81rSraYGbDN3MxsHsxsHk09c2VVNYLm7QIAdPNxxZcT/88kvZ9N/XO+XXl5OTw9PfUqbiRvS3344Yfo16+fUQoDT09PODo6orS0VGe8tLQUPj4+es3h5OSEkJAQHD9+vMHnFQoFFApFg68zxS+LqeY1JVvMDNhmbmY2D2Y2j6aa2Un8Vch8NqEP5HLJf0qlvV8T/ZwbmlNfkhuKhwwZoi1s/vjjD/zxxx9Sp9CSy+UIDQ1Fdna2dkyj0SA7O1tnJacxNTU1OHjwIHx9fQ3OQUREZAierGudJBc3Go0Gc+fORatWrdC+fXu0b98e7u7umDdvHjQajeQASqUSq1atwoYNG3DkyBFMmDABFRUViIuLA1B7L6tbG47nzp2LnTt34uTJk9i/fz9eeOEF/P777xg3bpzk9yYiIiL7I3ktbebMmVizZg0WLlyIvn37AgB2796NOXPm4ObNm3jjjTckzTdq1ChcvHgRSUlJKCkpQXBwMDIzM7WndxcXF8PB4a8a7M8//8T48eNRUlICDw8PhIaGYs+ePejevbvUH4WIiEgy3g3c+kkubjZs2IDVq1dj2LBh2rFevXrBz88PEydOlFzcAEB8fDzi4+MbfC4nJ0fn8dKlSyWdkUVERGQsvBu4bZC8LXXlyhV07dq13njXrl1x5coVo4QiIiKyRrwbuG2QXNwEBQVhxYoV9cZXrFiBoKAgo4QiIiKydrwbuPWSvC21aNEiDBkyBN9++632jKa8vDycOXMGO3bsMHpAIiIia8S6xnpJXrmJjIzEr7/+iieffBJXr17F1atX8dRTT+HYsWPo16+fKTISERFZBTYT2waDrjzUrl07gxqHiYiIbBWbiW2HXsXN//73P70n7NWrl8FhiIiIrBWbiW2HXsVNcHAwZDIZ7nYbKplMJumu4ERERLaIzcTWTa/i5tSpU6bOQUREZDNY11g3vYqb9u3bmzoHERGRVWMzse2QfLYUAGzcuBF9+/ZFu3bt8PvvvwMA0tPT8eWXXxo1HBERkTVgM7FtkVzcvPvuu1AqlRg8eDCuXr2q7bFxd3dHenq6sfMRERFZHJuJbYvk4mb58uVYtWoVZs6cCUfHv/5xw8LCcPDgQaOGIyIisjZsJrZ+koubU6dOISQkpN64QqFARUWFUUIRERFZK9Y11k9ycRMYGIiioqJ645mZmejWrZsxMhEREVkVNhPbFslXKFYqlZg0aRJu3rwJIQQKCgrw8ccfIzU1FatXrzZFRiIiIothM7HtkVzcjBs3Ds2bN8esWbNQWVmJmJgYtGvXDsuWLcOzzz5rioxEREQWw2Zi22PQvaWef/55PP/886isrMT169fh5eVl7FxERERWh83EtsGg4gYALly4gGPHjgGove1C27ZtjRaKiIjIGrGusQ2SG4qvXbuG0aNHo127doiMjERkZCTatWuHF154AWVlZabISERERKQ3ycXNuHHjkJ+fj+3bt+Pq1au4evUqvv76a/z000/417/+ZYqMRERERHqTvC319ddf45tvvsHDDz+sHYuOjsaqVavw+OOPGzUcERERkVSSV27atGmDVq1a1Rtv1aoVPDw8jBKKiIjIWvAaN7ZHcnEza9YsKJVKlJSUaMdKSkrw2muvYfbs2UYNR0REZEm8xo1t0mtbKiQkROfUt99++w33338/7r//fgBAcXExFAoFLl68yL4bIiKyG7zGjW3Sq7gZMWKEiWMQERFZN17jxnboVdwkJyebOgcREZFVY11jOyT33JjCypUrERAQAGdnZ4SHh6OgoECv123evBkymYwrS0REZBJsJrZNFi9utmzZAqVSieTkZOzfvx9BQUGIjo7GhQsXGn3d6dOnMW3aNPTr189MSYmIqClhM7Htsnhxk5aWhvHjxyMuLg7du3dHRkYGXFxcsHbt2ju+pqamBs8//zxSUlLQoUMHM6YlIqKmgs3Etsvge0sZQ1VVFQoLC5GYmKgdc3BwQFRUFPLy7lwtz507F15eXnjppZfwww8/NPoeKpUKKpVK+7i8vPYXVa1WQ61W3+NP8Je6uYw5p6nZYmbANnMzs3kws3k0lcxqdbX2+00vhaG6urqRo42vqXzOUufWh0wIaTuKc+fOxbRp0+Di4qIzfuPGDbz11ltISkrSe65z587Bz88Pe/bsQUREhHY8ISEBubm5yM/Pr/ea3bt349lnn0VRURE8PT3x4osv4urVq9i2bVuD7zFnzhykpKTUG9+0aVO9n4GIiKiOqgZIKKhdA1jUuxoKLtxYVGVlJWJiYlBWVgY3N7dGj5Vc3Dg6OuL8+fPw8vLSGb98+TK8vLxQU1Oj91xSi5tr166hV69eeOedd/DEE08AwF2Lm4ZWbvz9/XHp0qW7fjhSqNVqZGVlYeDAgXBycjLavKZki5kB28zNzObBzObRVDJXqKoRPH8XAODn2Y/BRW7ezY6m8jnrq7y8HJ6ennoVN5L/pYQQDZ7n//PPP6N169aS5vL09ISjoyNKS0t1xktLS+Hj41Pv+BMnTuD06dMYOnSodkyj0QAAmjVrhmPHjqFjx446r1EoFFAoFPXmcnJyMskvi6nmNSVbzAzYZm5mNg9mNg97ziyEQMw7e297nWU6Oez5c5Y6p770/pfy8PCATCaDTCZDly5ddAqcmpoaXL9+HS+//LKkoHK5HKGhocjOztaezq3RaJCdnY34+Ph6x3ft2hUHDx7UGZs1axauXbuGZcuWwd/fX9L7ExERNYTNxLZN7+ImPT0dQgiMHTsWKSkpOjfPlMvlCAgI0Nla0pdSqURsbCzCwsLQu3dvpKeno6KiAnFxcQCAMWPGwM/PD6mpqXB2dkaPHj10Xu/u7g4A9caJiIiMgVcmtj16FzexsbEAgMDAQPTt2xfNmhlneW7UqFG4ePEikpKSUFJSguDgYGRmZsLb2xtA7X2rHBwsfsY6ERE1UaxrbI/kCiUyMhInTpzAunXrcOLECSxbtgxeXl7473//i/vvvx8PPvig5BDx8fENbkMBQE5OTqOvXb9+veT3IyIiagyvTGzbJC+J5ObmomfPnsjPz8fWrVtx/fp1ALUNxbwHFRER2Tpemdj2SS5upk+fjvnz5yMrKwtyuVw7/thjj2Hv3r2NvJKIiMj6sZnY9kkubg4ePIgnn3yy3riXlxcuXbpklFBERETWgM3EtklycePu7o7z58/XGz9w4AD8/PyMEoqIiMgasK6xTZKLm2effRavv/46SkpKIJPJoNFo8OOPP2LatGkYM2aMKTISERGZDZuJbZ/k4mbBggXo2rUr/P39cf36dXTv3h2PPPII+vTpg1mzZpkiIxERkVmwmdg+SD4VXC6XY9WqVUhKSsLBgwdx/fp1hISEoHPnzqbIR0REZDZsJrYPBl+Jz9/fH/7+/qipqcHBgwfx559/wsPDw5jZiIiILIbNxLZL8rbUK6+8gjVr1gCovadUZGQkHnroIfj7+9/1gntERES2gnWN7ZJc3Hz22WcICgoCAHz11Vc4efIkjh49iqlTp2LmzJlGD0hERGQubCa2D5KLm0uXLsHHxwcAsGPHDowcORJdunTB2LFj692xm4iIyFawmdh+SC5uvL29cfjwYdTU1CAzMxMDBw4EAFRWVsLRkY1XRERkm9hMbD8kNxTHxcVh5MiR8PX1hUwmQ1RUFAAgPz8fXbt2NXpAIiIic2MzsW2TXNzMmTMHPXr0wJkzZ/DMM89AoVAAABwdHTF9+nSjByQiIjI31jW2zaBTwf/xj3/UG4uNjb3nMERERET3yqDipqKiArm5uSguLkZVVZXOc5MnTzZKMCIiIiJDSC5uDhw4gMGDB6OyshIVFRVo3bo1Ll26BBcXF3h5ebG4ISIim8TTwO2H5LOlpk6diqFDh+LPP/9E8+bNsXfvXvz+++8IDQ3F4sWLTZGRiIjIpHgauH2RXNwUFRXh1VdfhYODAxwdHaFSqeDv749FixZhxowZpshIRERkUjwN3L5ILm6cnJzg4FD7Mi8vLxQXFwMAWrVqhTNnzhg3HRERkZnxNHDbJ7nnJiQkBPv27UPnzp0RGRmJpKQkXLp0CRs3bkSPHj1MkZGIiMhsWNfYPskrNwsWLICvry8A4I033oCHhwcmTJiAixcv4r333jN6QCIiIlNjM7F9kbxyExYWpv3ey8sLmZmZRg1ERERkTmwmtj+SV24ee+wxXL16td54eXk5HnvsMWNkIiIiMhs2E9sfycVNTk5OvQv3AcDNmzfxww8/GCUUERGRJbCZ2D7ovS31v//9T/v94cOHUVJSon1cd4dwPz8/46YjIiIysVv7bVjX2Ae9V26Cg4MREhICmUyGxx57DMHBwdqv0NBQzJ8/H0lJSQaFWLlyJQICAuDs7Izw8HAUFBTc8ditW7ciLCwM7u7uaNGiBYKDg7Fx40aD3peIiJo29tvYJ71Xbk6dOgUhBDp06ICCggK0bdtW+5xcLoeXlxccHaXvU27ZsgVKpRIZGRkIDw9Heno6oqOjcezYMXh5edU7vnXr1pg5cya6du0KuVyOr7/+GnFxcfDy8kJ0dLTk9ycioqaL/Tb2Se/ipn379gAAjUZj1ABpaWkYP3484uLiAAAZGRnYvn071q5di+nTp9c7vn///jqPp0yZgg0bNmD37t0sboiIyGDst7EfBt0VHKjtu2noruDDhg3Te46qqioUFhYiMTFRO+bg4ICoqCjk5d19mVAIgV27duHYsWN48803GzxGpVJBpVJpH5eX11boarUaarVa76x3UzeXMec0NVvMDNhmbmY2D2Y2D3vKrFZXa7+vrlZD7WA9F7yxp8/ZmHPrQyaEtEsXnTx5Ek8++SQOHjwImUyGupfXVbs1NTV6z3Xu3Dn4+flhz549iIiI0I4nJCQgNzcX+fn5Db6urKwMfn5+UKlUcHR0xDvvvIOxY8c2eOycOXOQkpJSb3zTpk1wcXHROysREdkfVQ2QUFD73/mLeldDwV0pq1VZWYmYmBiUlZXBzc2t0WMlr9xMmTIFgYGByM7ORmBgIAoKCnD58mW8+uqrZrsruKurK4qKinD9+nVkZ2dDqVSiQ4cO9basACAxMRFKpVL7uLy8HP7+/hg0aNBdPxwp1Go1srKyMHDgQDg5ORltXlOyxcyAbeZmZvNgZvOwl8xCCAx/Zy+AawCA6OhBcJEbvKFhdPbyORtL3c6LPiT/K+bl5WHXrl3w9PSEg4MDHBwc8PDDDyM1NRWTJ0/GgQMH9J7L09MTjo6OKC0t1RkvLS2Fj4/PHV/n4OCATp06Aag9i+vIkSNITU1tsLhRKBRQKBT1xp2cnEzyy2KqeU3JFjMDtpmbmc2Dmc3D1jNXVlXjSEltYdPd1w1uLs5W2XNj65+zMefUl+SL+NXU1MDV1RVAbXFy7tw5ALUNx8eOHZM0l1wuR2hoKLKzs7VjGo0G2dnZOttUd6PRaHT6aoiIiKRgM7F9kbxy06NHD/z8888IDAxEeHg4Fi1aBLlcjvfffx8dOnSQHECpVCI2NhZhYWHo3bs30tPTUVFRoT17asyYMfDz80NqaioAIDU1FWFhYejYsSNUKhV27NiBjRs34t1335X83kRE1HTx4n32S3JxM2vWLFRUVAAA5s6di7///e/o168f2rRpgy1btkgOMGrUKFy8eBFJSUkoKSlBcHAwMjMz4e3tDQAoLi6Gg8NfC0wVFRWYOHEi/vjjDzRv3hxdu3bFhx9+iFGjRkl+byIiapp48T77Jrm4ufVaMp06dcLRo0dx5coVeHh4GLykFx8fj/j4+Aafy8nJ0Xk8f/58zJ8/36D3ISIiAnjxPntnlLbw1q1bG2MaIiIis2O/jf3Rq7h56qmn9J5w69atBochIiIyN9Y19kevs6VatWql/XJzc0N2djZ++ukn7fOFhYXIzs5Gq1atTBaUiIiISB96rdysW7dO+/3rr7+OkSNHIiMjQ3ujzJqaGkycONGoF8UjIiIiMoTk69ysXbsW06ZN07kDuKOjI5RKJdauXWvUcERERKYg7cZDZGskFzfV1dU4evRovfGjR48a/Y7hRERExsbTwO2f5LOl4uLi8NJLL+HEiRPo3bs3ACA/Px8LFy7UXniPiIjIWvE0cPsnubhZvHgxfHx8sGTJEpw/fx4A4Ovri9deew2vvvqq0QMSERGZCk8Dt0+SixsHBwckJCQgISFBe4dONhITEZGt4G0X7N89XcSPRQ0REdmS2n6bvZaOQSYmuaGYiIjIVrHfpmlgcUNERE0S+23sF4sbIiJqMthv0zToVdy0bt0aly5dAgCMHTsW165dM2koIiIiYxMCeG71PkvHIDPQq7ipqqrSnhm1YcMG3Lx506ShiIiIjK1KAxwpqf2Pc/bb2De9zpaKiIjAiBEjEBoaCiEEJk+ejObNmzd4LG/BQERE1o79NvZNr+Lmww8/xNKlS3HixAnIZDKUlZVx9YaIiGwW6xr7pldx4+3tjYULFwIAAgMDsXHjRrRp08akwYiIiIxFCIFlv3AbqqmQfBG/U6dOmSIHERGRydxQ1+BsZe1yDftt7J9Bp4Ln5uZi6NCh6NSpEzp16oRhw4bhhx9+MHY2IiIio2O/jf2TXNx8+OGHiIqKgouLCyZPnqxtLh4wYAA2bdpkioxERET3hNe3aVokb0u98cYbWLRoEaZOnaodmzx5MtLS0jBv3jzExMQYNSAREdG9EELw+jZNjOSVm5MnT2Lo0KH1xocNG8Z+HCIisjo31DXa69t083Flv00TILm48ff3R3Z2dr3xb7/9Fv7+/kYJRUREZAofj/sb+22aAMnbUq+++iomT56MoqIi9OnTBwDw448/Yv369Vi2bJnRAxIRERkL65qmQXJxM2HCBPj4+GDJkiX45JNPAADdunXDli1bMHz4cKMHJCIiuhe3NhNT02DQqeBPPvkkdu/ejcuXL+Py5cvYvXv3PRU2K1euREBAAJydnREeHo6CgoI7Hrtq1Sr069cPHh4e8PDwQFRUVKPHExFR0yWEwDMZeZaOQWZmUHFjTFu2bIFSqURycjL279+PoKAgREdH48KFCw0en5OTg+eeew7fffcd8vLy4O/vj0GDBuHs2bNmTk5ERNbuhroGh8/X3vjZz0WwmbiJsHhxk5aWhvHjxyMuLg7du3dHRkYGXFxc7ngDzo8++ggTJ05EcHAwunbtitWrV0Oj0TTY5ExERFRnSo8aNhM3EZJ7boypqqoKhYWFSExM1I45ODggKioKeXn6LSNWVlZCrVajdevWDT6vUqmgUqm0j8vLayt4tVoNtVp9D+l11c1lzDlNzRYzA7aZm5nNg5nNw5YyV1VV6zy2hcx1bOlzrmPKzFLmlAlhuVarc+fOwc/PD3v27EFERIR2PCEhAbm5ucjPz7/rHBMnTsQ333yDQ4cOwdnZud7zc+bMQUpKSr3xTZs2wcXF5d5+ACIislpCAG/9z1F7T6lFvauh4K6UzaqsrERMTAzKysrg5ubW6LH3tHJTVxdZaplv4cKF2Lx5M3JychosbAAgMTERSqVS+7i8vFzbp3O3D0cKtVqNrKwsDBw4EE5OTkab15RsMTNgm7mZ2TyY2TxsJXNlVTVe2bsLANDVpyXkDletPvOtbOVzvpUpM9ftvOjDoOLmgw8+wFtvvYXffvsNANClSxe89tprGD16tKR5PD094ejoiNLSUp3x0tJS+Pj4NPraxYsXY+HChfj222/Rq1evOx6nUCigUCjqjTs5OZnkl8VU85qSLWYGbDM3M5sHM5uHtWdupvnrP7w3j+uN3OydVp+5Icz815z6ktxQnJaWhgkTJmDw4MH45JNP8Mknn+Dxxx/Hyy+/jKVLl0qaSy6XIzQ0VKcZuK45+NZtqtstWrQI8+bNQ2ZmJsLCwqT+CEREZOduPwWcfcRNi+SVm+XLl+Pdd9/FmDFjtGPDhg3Dgw8+iDlz5ujcUFMfSqUSsbGxCAsLQ+/evZGeno6KigrExcUBAMaMGQM/Pz+kpqYCAN58800kJSVh06ZNCAgIQElJCQCgZcuWaNmypdQfh4iI7NCtp4B393XjKeBNjOTi5vz589rbLtyqT58+OH/+vOQAo0aNwsWLF5GUlISSkhIEBwcjMzMT3t7eAIDi4mI4OPy1wPTuu++iqqoK//jHP3TmSU5Oxpw5cyS/PxER2Z9bT5X59OUIyGS8THFTIrm46dSpEz755BPMmDFDZ3zLli3o3LmzQSHi4+MRHx/f4HM5OTk6j0+fPm3QexARUdPALSmSXNykpKRg1KhR+P7779G3b18AtTfOzM7O1t5rioiIyFIa2pKqrq6+y6vInkhuKH766aeRn58PT09PbNu2Ddu2bYOnpycKCgrw5JNPmiIjERGR3upvSXHppqkx6FTw0NBQfPjhh8bOQkREdE+4JUWAnsVNeXm59oJ3d7uIjjEvjEdERCQFz5IiQM/ixsPDA+fPn4eXlxfc3d0bXOITQkAmk6GmpsboIYmIiKTillTTpVdxs2vXLu2NKb/77juTBiIiIjIG1jVNl17FTWRkpPb7wMBA+Pv716uGhRA4c+aMcdMRERFJYLlbQZM1kXy2VGBgIC5evFhv/MqVKwgMDDRKKCIiIqlubyampktycVPXW3O769ev3/HO3ERERKbGZmKqo/ep4EqlEgAgk8kwe/ZsuLi4aJ+rqalBfn4+goODjR6QiIhIH7y+DdXRu7g5cOAAgNqVm4MHD0Iul2ufk8vlCAoKwrRp04yfkIiI6C54fRu6ld7FTd1ZUnFxcVi2bBmvZ0NERFaDW1J0K8lXKF63bp0pchARERmMW1J0K4Nuv/DTTz/hk08+QXFxMaqqqnSe27p1q1GCERER6YNbUnQ7yWdLbd68GX369MGRI0fwxRdfQK1W49ChQ9i1axdatWplioxERER3VFnFLSnSJbm4WbBgAZYuXYqvvvoKcrkcy5Ytw9GjRzFy5Ejcf//9pshIRETUoNtXbbglRYABxc2JEycwZMgQALVnSVVUVEAmk2Hq1Kl4//33jR6QiIjoTm5vJHaRc9WGDChuPDw8cO3aNQCAn58ffvnlFwDA1atXUVlZadx0REREjWAjMTVEckPxI488gqysLPTs2RPPPPMMpkyZgl27diErKwsDBgwwRUYiIqJ62EhMdyK5uFmxYgVu3rwJAJg5cyacnJywZ88ePP3005g1a5bRAxIRETWE17ahO5Fc3LRu3Vr7vYODA6ZPn659fOPGDeOkIiIikoBbUnQryT03DVGpVEhLS+NdwYmIyGxu7bdhXUO30ru4UalUSExMRFhYGPr06YNt27YBqL1icWBgIJYuXYqpU6eaKicREZHW7f02RLfSe1sqKSkJ7733HqKiorBnzx4888wziIuLw969e5GWloZnnnkGjo7c7yQiItPjhfuoMXoXN59++ik++OADDBs2DL/88gt69eqF6upq/Pzzz9znJCIis+GF++hu9N6W+uOPPxAaGgoA6NGjBxQKBaZOncpfKCIiMqvbV2144T66nd7FTU1NDeRyufZxs2bN0LJly3sOsHLlSgQEBMDZ2Rnh4eEoKCi447GHDh3C008/jYCAAMhkMqSnp9/z+xMRke3gqg3pQ+9tKSEEXnzxRSgUCgDAzZs38fLLL6NFixY6x0m5K/iWLVugVCqRkZGB8PBwpKenIzo6GseOHYOXl1e94ysrK9GhQwc888wzbF4mImqCuGpD+tC7uImNjdV5/MILL9zzm6elpWH8+PGIi4sDAGRkZGD79u1Yu3atzvVz6vztb3/D3/72NwBo8HkiIrJfXLUhfeld3Kxbt86ob1xVVYXCwkIkJiZqxxwcHBAVFYW8POOd3qdSqaBSqbSPy8trK361Wg21Wm2096mby5hzmpotZgZsMzczmwczm4elMleoqrWrNt18XOEk0+idgZ+zeZgys5Q5ZULcehkk8zl37hz8/PywZ88eREREaMcTEhKQm5uL/Pz8Rl8fEBCAV155Ba+88kqjx82ZMwcpKSn1xjdt2gQXFxeDshMRkXkJAbz1P0ecraxdqVnUuxoK7kg1KZWVlYiJiUFZWRnc3NwaPVby7RdsTWJiIpRKpfZxeXk5/P39MWjQoLt+OFKo1WpkZWVh4MCBcHJyMtq8pmSLmQHbzM3M5sHM5mGJzJVV1Xhl7y4Atas2I/7+f5K2pPg5m4cpM9ftvOjDYsWNp6cnHB0dUVpaqjNeWloKHx8fo72PQqHQNkHfysnJySS/LKaa15RsMTNgm7mZ2TyY2TzMmbmZ5q9C5rMJfSCXG/bni5+zeZgis5T5jHJvKUPI5XKEhoYiOztbO6bRaJCdna2zTUVERE3b7Y3E7CGmu7HotpRSqURsbCzCwsLQu3dvpKeno6KiQnv21JgxY+Dn54fU1FQAtU3Ihw8f1n5/9uxZFBUVoWXLlujUqZPFfg4iIjId3mqBpLJocTNq1ChcvHgRSUlJKCkpQXBwMDIzM+Ht7Q0AKC4uhoPDX4tL586dQ0hIiPbx4sWLsXjxYkRGRiInJ8fc8YmIyMR4+jcZwuINxfHx8YiPj2/wudsLloCAAFjo5C4iIrIAXrSPDGGxnhsiIqLGaDQCf1++W/uYqzakLxY3RERkdYSoLWxOXaoAwFUbkobFDRERWZ1bt6MCPVvg638/zFUb0huLGyIisiq3b0d9/e+H4eDAwob0x+KGiIisBrejyBhY3BARkdXgdhQZA4sbIiKyCrdf04bbUWQoFjdERGQVeE0bMhYWN0REZHG8pg0ZE4sbIiKyKI1GYEBaLpuIyWhY3BARkcXcXtiwiZiMgcUNERFZxO2nfQd6tkC2MpJNxHTPWNwQEZFF3H7aNwsbMhYWN0REZHa8CjGZEosbIiIyKzYQk6k1s3QAIiJqGoQQqKyqqddnwwZiMjYWN0REZHJCCPwjIw+Fv/+pHWOfDZkKt6WIiMikhBC4XFGlU9h093VjYUMmw5UbIiIymbrG4bqzogDgp1lRaNNCzq0oMhkWN0REZHQN9dcAQFh7DxY2ZHIsboiIyKgaWq2paxx2kTuysCGTY3FDRET3TAiBG+oaCIF6qzXdfd14HRsyKxY3RERksLrtp2cy8nRWagCu1pDlsLghIiJJhBBQ1QAVqmrEvLO3XlEDcLWGLIvFDRER3dWt207/eHcvjpQ0Q0LBLp1juvu64dOXIyCTAc2duFpDlsPihoiIdNQVMn89RoPbTnXqihpuP5G1sIriZuXKlXjrrbdQUlKCoKAgLF++HL17977j8Z9++ilmz56N06dPo3PnznjzzTcxePBgMyYmIrJttxcwf403XsjU8XMR2K4cALncias0ZHUsXtxs2bIFSqUSGRkZCA8PR3p6OqKjo3Hs2DF4eXnVO37Pnj147rnnkJqair///e/YtGkTRowYgf3796NHjx4W+AmIiEzvTsWIYXPpV8Dcrm6Fprpaje+ydqKFohmcnCz+Z4SoHov/VqalpWH8+PGIi4sDAGRkZGD79u1Yu3Ytpk+fXu/4ZcuW4fHHH8drr70GAJg3bx6ysrKwYsUKZGRkmDX7reoa7CqrquEkbOO/YNTqapvLDNhmbmY2D3vNbGgxci9u7Z+pU7dCo3YQ4EINWTOLFjdVVVUoLCxEYmKidszBwQFRUVHIy8tr8DV5eXlQKpU6Y9HR0di2bVuDx6tUKqhUKu3jsrIyAMCVK1egVqvv8Sf4S3nFTUz7oQrTfvjaaHOaiy1mBmwzNzObBzPr5wHvllgz5qEGC5XmTo64eb1MZ+zG//+/arUalZWVuHz5MpycnEwf1AiY2TxMmfnatWsAahcT7saixc2lS5dQU1MDb29vnXFvb28cPXq0wdeUlJQ0eHxJSUmDx6empiIlJaXeeGBgoIGpiYjswxkA7RPvehiRVbl27RpatWrV6DEW35YytcTERJ2VHo1GgytXrqBNmzZGbYArLy+Hv78/zpw5Azc3N6PNa0q2mBmwzdzMbB7MbB7MbB7MrEsIgWvXrqFdu3Z3PdaixY2npyccHR1RWlqqM15aWgofH58GX+Pj4yPpeIVCAYVCoTPm7u5ueOi7cHNzs5lfwjq2mBmwzdzMbB7MbB7MbB7M/Je7rdjUcTD6O0sgl8sRGhqK7Oxs7ZhGo0F2djYiIiIafE1ERITO8QCQlZV1x+OJiIioabH4tpRSqURsbCzCwsLQu3dvpKeno6KiQnv21JgxY+Dn54fU1FQAwJQpUxAZGYklS5ZgyJAh2Lx5M3766Se8//77lvwxiIiIyEpYvLgZNWoULl68iKSkJJSUlCA4OBiZmZnapuHi4mI4OPy1wNSnTx9s2rQJs2bNwowZM9C5c2ds27bN4te4USgUSE5OrrcFZs1sMTNgm7mZ2TyY2TyY2TyY2XAyoc85VUREREQ2wqI9N0RERETGxuKGiIiI7AqLGyIiIrIrLG6IiIjIrrC4kWDlypUICAiAs7MzwsPDUVBQ0Ojxn376Kbp27QpnZ2f07NkTO3bsMFPSv0jJfOjQITz99NMICAiATCZDenq6+YLeQkrmVatWoV+/fvDw8ICHhweioqLu+u9iKlJyb926FWFhYXB3d0eLFi0QHByMjRs3mjFtLam/03U2b94MmUyGESNGmDZgA6RkXr9+PWQymc6Xs7OzGdPWkvo5X716FZMmTYKvry8UCgW6dOli9v/9kJK5f//+9T5nmUyGIUOGmDGx9M85PT0dDzzwAJo3bw5/f39MnToVN2/eNFPaWlIyq9VqzJ07Fx07doSzszOCgoKQmZlpxrTA999/j6FDh6Jdu3aQyWR3vK/jrXJycvDQQw9BoVCgU6dOWL9+vclzQpBeNm/eLORyuVi7dq04dOiQGD9+vHB3dxelpaUNHv/jjz8KR0dHsWjRInH48GExa9Ys4eTkJA4ePGi1mQsKCsS0adPExx9/LHx8fMTSpUvNlrWO1MwxMTFi5cqV4sCBA+LIkSPixRdfFK1atRJ//PGHVef+7rvvxNatW8Xhw4fF8ePHRXp6unB0dBSZmZlWm7nOqVOnhJ+fn+jXr58YPny4ecL+f1Izr1u3Tri5uYnz589rv0pKSqw6s0qlEmFhYWLw4MFi9+7d4tSpUyInJ0cUFRVZbebLly/rfMa//PKLcHR0FOvWrbPazB999JFQKBTio48+EqdOnRLffPON8PX1FVOnTrXazAkJCaJdu3Zi+/bt4sSJE+Kdd94Rzs7OYv/+/WbLvGPHDjFz5kyxdetWAUB88cUXjR5/8uRJ4eLiIpRKpTh8+LBYvny5Wf63jsWNnnr37i0mTZqkfVxTUyPatWsnUlNTGzx+5MiRYsiQITpj4eHh4l//+pdJc95KauZbtW/f3iLFzb1kFkKI6upq4erqKjZs2GCqiA2619xCCBESEiJmzZplingNMiRzdXW16NOnj1i9erWIjY01e3EjNfO6detEq1atzJSuYVIzv/vuu6JDhw6iqqrKXBHrudff56VLlwpXV1dx/fp1U0WsR2rmSZMmiccee0xnTKlUir59+5o0562kZvb19RUrVqzQGXvqqafE888/b9Kcd6JPcZOQkCAefPBBnbFRo0aJ6OhoEyYTgttSeqiqqkJhYSGioqK0Yw4ODoiKikJeXl6Dr8nLy9M5HgCio6PveLyxGZLZ0oyRubKyEmq1Gq1btzZVzHruNbcQAtnZ2Th27BgeeeQRU0bVMjTz3Llz4eXlhZdeeskcMXUYmvn69eto3749/P39MXz4cBw6dMgccQEYlvk///kPIiIiMGnSJHh7e6NHjx5YsGABampqrDbz7dasWYNnn30WLVq0MFVMHYZk7tOnDwoLC7XbQCdPnsSOHTswePBgq82sUqnqbas2b94cu3fvNmnWe2Gpv4UsbvRw6dIl1NTUaK+aXMfb2xslJSUNvqakpETS8cZmSGZLM0bm119/He3atav3/0ymZGjusrIytGzZEnK5HEOGDMHy5csxcOBAU8cFYFjm3bt3Y82aNVi1apU5ItZjSOYHHngAa9euxZdffokPP/wQGo0Gffr0wR9//GGOyAZlPnnyJD777DPU1NRgx44dmD17NpYsWYL58+ebI/I9//9hQUEBfvnlF4wbN85UEesxJHNMTAzmzp2Lhx9+GE5OTujYsSP69++PGTNmmCOyQZmjo6ORlpaG3377DRqNBllZWdi6dSvOnz9vjsgGudPfwvLycty4ccNk78vihuzGwoULsXnzZnzxxRcWaRqVytXVFUVFRdi3bx/eeOMNKJVK5OTkWDpWg65du4bRo0dj1apV8PT0tHQcvUVERGDMmDEIDg5GZGQktm7dirZt2+K9996zdLQ70mg08PLywvvvv4/Q0FCMGjUKM2fOREZGhqWj6WXNmjXo2bMnevfubekojcrJycGCBQvwzjvvYP/+/di6dSu2b9+OefPmWTraHS1btgydO3dG165dIZfLER8fj7i4OJ1bFFEti99byhZ4enrC0dERpaWlOuOlpaXw8fFp8DU+Pj6Sjjc2QzJb2r1kXrx4MRYuXIhvv/0WvXr1MmXMegzN7eDggE6dOgEAgoODceTIEaSmpqJ///6mjAtAeuYTJ07g9OnTGDp0qHZMo9EAAJo1a4Zjx46hY8eOVpW5IU5OTggJCcHx48dNEbEeQzL7+vrCyckJjo6O2rFu3bqhpKQEVVVVkMvlVpe5TkVFBTZv3oy5c+eaMmI9hmSePXs2Ro8erV1h6tmzJyoqKvDPf/4TM2fONHnBYEjmtm3bYtu2bbh58yYuX76Mdu3aYfr06ejQoYNJs96LO/0tdHNzQ/PmzU32viz39CCXyxEaGors7GztmEajQXZ2NiIiIhp8TUREhM7xAJCVlXXH443NkMyWZmjmRYsWYd68ecjMzERYWJg5ouow1met0WigUqlMEbEeqZm7du2KgwcPoqioSPs1bNgwPProoygqKoK/v7/VZW5ITU0NDh48CF9fX1PF1GFI5r59++L48ePa4hEAfv31V/j6+pq8sDE0c51PP/0UKpUKL7zwgqlj6jAkc2VlZb0Cpq6gFGa45eK9fM7Ozs7w8/NDdXU1Pv/8cwwfPtzUcQ1msb+FJm1XtiObN28WCoVCrF+/Xhw+fFj885//FO7u7trTSkePHi2mT5+uPf7HH38UzZo1E4sXLxZHjhwRycnJFjkVXEpmlUolDhw4IA4cOCB8fX3FtGnTxIEDB8Rvv/1mtZkXLlwo5HK5+Oyzz3RORb127ZrZMhuSe8GCBWLnzp3ixIkT4vDhw2Lx4sWiWbNmYtWqVVab+XaWOFtKauaUlBTxzTffiBMnTojCwkLx7LPPCmdnZ3Ho0CGrzVxcXCxcXV1FfHy8OHbsmPj666+Fl5eXmD9/vtVmrvPwww+LUaNGmS3nraRmTk5OFq6uruLjjz8WJ0+eFDt37hQdO3YUI0eOtNrMe/fuFZ9//rk4ceKE+P7778Vjjz0mAgMDxZ9//mm2zNeuXdP+nQAg0tLSxIEDB8Tvv/8uhBBi+vTpYvTo0drj604Ff+2118SRI0fEypUreSq4tVm+fLm4//77hVwuF7179xZ79+7VPhcZGSliY2N1jv/kk09Ely5dhFwuFw8++KDYvn27mRNLy3zq1CkBoN5XZGSk1WZu3759g5mTk5PNmllq7pkzZ4pOnToJZ2dn4eHhISIiIsTmzZutOvPtLFHcCCEt8yuvvKI91tvbWwwePNis1wQxJLMQQuzZs0eEh4cLhUIhOnToIN544w1RXV1t1ZmPHj0qAIidO3eaNeetpGRWq9Vizpw5omPHjsLZ2Vn4+/uLiRMnmrVQkJo5JydHdOvWTSgUCtGmTRsxevRocfbsWbPm/e677xr839y6nLGxsfX+Znz33XciODhYyOVy0aFDB7Nc/0gmhBnW34iIiIjMhD03REREZFdY3BAREZFdYXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERnV+vXr4e7ubukYOH36NGQyGYqKiu5pnv79++OVV17RPg4ICEB6evo9zQkAL774IkaMGHHP8xBRfSxuiJqYkpIS/Pvf/0aHDh2gUCjg7++PoUOH1rv/i6FGjRqFX3/91ShzNebUqVOIiYlBu3bt4OzsjPvuuw/Dhw/H0aNHAQD+/v44f/48evTocU/vs3XrVpPcKXrZsmVYv3699vHtRRQRGY53BSdqQk6fPo2+ffvC3d0db731Fnr27Am1Wo1vvvkGkyZN0hYG96J58+YmvdsvAKjVagwcOBAPPPAAtm7dCl9fX/zxxx/473//i6tXrwKovQmivncLb0zr1q3veY5b1dTUQCaToVWrVkadl4huYfIbPBCR1XjiiSeEn5+fuH79er3nbr2nzu+//y6GDRsmWrRoIVxdXcUzzzyjvZmfEEIUFRWJ/v37i5YtWwpXV1fx0EMPiX379gkhhFi3bp1o1aqV9tjk5GQRFBQkPvjgA9G+fXvh5uYmRo0aJcrLy7XH1NTUiAULFoiAgADh7OwsevXqJT799NM7/hx1N+07ffr0HY+pu1fagQMHhBB/3RMnMzNTBAcHC2dnZ/Hoo4+K0tJSsWPHDtG1a1fh6uoqnnvuOVFRUaGdJzIyUkyZMkX7uH379mLp0qXax0uWLBE9evQQLi4u4r777hMTJkzQuXFr3efx5Zdfim7duglHR0dx6tQpnXtzxcbG1rtXz8mTJ0XHjh3FW2+91eDPbs4b2hLZGm5LETURV65cQWZmJiZNmoQWLVrUe76uT0aj0WD48OG4cuUKcnNzkZWVhZMnT2LUqFHaY59//nncd9992LdvHwoLCzF9+nQ4OTnd8b1PnDiBbdu24euvv8bXX3+N3NxcLFy4UPt8amoqPvjgA2RkZODQoUOYOnUqXnjhBeTm5jY4X9u2beHg4IDPPvsMNTU1kj6HOXPmYMWKFdizZw/OnDmDkSNHIj09HZs2bcL27duxc+dOLF++XO/5HBwc8Pbbb+PQoUPYsGEDdu3ahYSEBJ1jKisr8eabb2L16tU4dOgQvLy8dJ5ftmwZIiIiMH78eJw/fx7nz5/H/fffj7Fjx2LdunU6x65btw6PPPIIOnXqJOnnJmpSLF1dEZF55OfnCwBi69atjR63c+dO4ejoKIqLi7Vjhw4dEgBEQUGBEEIIV1dXsX79+gZf39DKjYuLi85KzWuvvSbCw8OFEELcvHlTuLi4iD179ujM89JLL4nnnnvujjlXrFghXFxchKurq3j00UfF3LlzxYkTJ7TP32nl5ttvv9Uek5qaKgDovO5f//qXiI6O1j6+28rN7T799FPRpk0bnc8DgCgqKtI57va7qt/+PkIIcfbsWeHo6Cjy8/OFEEJUVVUJT0/PO372RFSLKzdETYQQQq/jjhw5An9/f/j7+2vHunfvDnd3dxw5cgQAoFQqMW7cOERFRWHhwoU4ceJEo3MGBATA1dVV+9jX1xcXLlwAABw/fhyVlZUYOHAgWrZsqf364IMPGp130qRJKCkpwUcffYSIiAh8+umnePDBB5GVldVoll69emm/9/b2houLCzp06KAzVpdNH99++y0GDBgAPz8/uLq6YvTo0bh8+TIqKyu1x8jlcp331Ve7du0wZMgQrF27FgDw1VdfQaVS4ZlnnpE8F1FTwuKGqIno3LkzZDKZUZqG58yZg0OHDmHIkCHYtWsXunfvji+++OKOx9++ZSWTyaDRaAAA169fBwBs374dRUVF2q/Dhw/js88+azSHq6srhg4dijfeeAM///wz+vXrh/nz5zf6mluzyGSyRrPdzenTp/H3v/8dvXr1wueff47CwkKsXLkSAFBVVaU9rnnz5pDJZHrNebtx48Zh8+bNuHHjBtatW4dRo0bBxcXFoLmImgoWN0RNROvWrREdHY2VK1eioqKi3vN1Zxl169YNZ86cwZkzZ7TPHT58GFevXkX37t21Y126dMHUqVOxc+dOPPXUU/V6Q/TVvXt3KBQKFBcXo1OnTjpft64e3Y1MJkPXrl0b/NlMpbCwEBqNBkuWLMH//d//oUuXLjh37pxBc8nl8gb7hwYPHowWLVrg3XffRWZmJsaOHXuvsYnsHosboiZk5cqVqKmpQe/evfH555/jt99+w5EjR/D2228jIiICABAVFYWePXvi+eefx/79+1FQUIAxY8YgMjISYWFhuHHjBuLj45GTk4Pff/8dP/74I/bt24du3boZlMnV1RXTpk3D1KlTsWHDBpw4cQL79+/H8uXLsWHDhgZfU1RUhOHDh+Ozzz7D4cOHcfz4caxZswZr167F8OHDDf58pOrUqRPUajWWL1+OkydPYuPGjcjIyDBoroCAAOTn5+P06dO4dOmSdvXI0dERL774IhITE9G5c2ftvxMR3RmLG6ImpEOHDti/fz8effRRvPrqq+jRowcGDhyI7OxsvPvuuwBqV0C+/PJLeHh44JFHHkFUVBQ6dOiALVu2AKj9Y3v58mWMGTMGXbp0wciRI/HEE08gJSXF4Fzz5s3D7NmzkZqaim7duuHxxx/H9u3bERgY2ODx9913HwICApCSkoLw8HA89NBDWLZsGVJSUjBz5kyDc0gVFBSEtLQ0vPnmm+jRowc++ugjpKamGjTXtGnT4OjoiO7du6Nt27YoLi7WPvfSSy+hqqoKcXFxxopOZNdkQt8uQyIisogffvgBAwYMwJkzZ+Dt7W3pOERWj8UNEZGVUqlUuHjxImJjY+Hj44OPPvrI0pGIbAK3pYiIrNTHH3+M9u3b4+rVq1i0aJGl4xDZDK7cEBERkV3hyg0RERHZFRY3REREZFdY3BAREZFdYXFDREREdoXFDREREdkVFjdERERkV1jcEBERkV1hcUNERER2hcUNERER2ZX/B7e2w9rkroZEAAAAAElFTkSuQmCC",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.ecdf(x=similarity_across_dataset.keys(), weights=similarity_across_dataset.values())\n",
    "plt.xticks(np.linspace(0, 1, 11))\n",
    "plt.yticks(np.linspace(0, 1, 11))\n",
    "plt.xlabel(\"Cosine Similarity\")\n",
    "plt.ylabel(\"Ratio of dataset below the similarity score\")\n",
    "plt.grid()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd43865d",
   "metadata": {},
   "source": [
    "#### Looking at Duplicates\n",
    "\n",
    "- `id` : This is a list of all IDs that are above our similarity threshold `eps`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "2578b143",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00031354-7148-4513-af11-4c93c73db7ba</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0003de98-af44-4181-8ff5-9c8d86be8b96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0004215a-4560-4f23-9fbb-7e63b7a78be9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00065e09-b164-48ac-80fc-e4fd7a760c47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>000d15ff-1f2e-4f21-974f-e2e9d4d96578</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     id\n",
       "0  00031354-7148-4513-af11-4c93c73db7ba\n",
       "1  0003de98-af44-4181-8ff5-9c8d86be8b96\n",
       "2  0004215a-4560-4f23-9fbb-7e63b7a78be9\n",
       "3  00065e09-b164-48ac-80fc-e4fd7a760c47\n",
       "4  000d15ff-1f2e-4f21-974f-e2e9d4d96578"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "duplicates_path = os.path.join(output_path, \"duplicates\")\n",
    "\n",
    "pd.read_parquet(os.path.join(duplicates_path, os.listdir(duplicates_path)[0])).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "018bf976",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "We found 320,467 duplicates in the dataset\n"
     ]
    }
   ],
   "source": [
    "num_duplicates = sum(pq.read_metadata(os.path.join(duplicates_path, f)).num_rows for f in os.listdir(duplicates_path))\n",
    "print(f\"We found {num_duplicates:,} duplicates in the dataset\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98c21b65",
   "metadata": {},
   "source": [
    "#### Looking at the Deduplicated Dataset\n",
    "\n",
    "Here we see all the original columns. \n",
    "\n",
    "We can control the schema of this by specifying the `output_fields` argument in the workflow definition.\n",
    "\n",
    "If you had set `use_id_generator=True` then you'd see `_curator_dedup_id` here as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "ee288d0c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Once upon a time, there was a big dog named Ma...</td>\n",
       "      <td>ad9623c1-e820-417b-a894-8ac327a63781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Once upon a time, there was a shy little girl ...</td>\n",
       "      <td>d2857534-6995-4c30-b06f-bb842f683601</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Once upon a time, there was a little boy named...</td>\n",
       "      <td>0fe524d8-ca21-4c04-b344-9af8015f52e1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Once upon a time, there was a little girl name...</td>\n",
       "      <td>87222f54-fc70-4f93-825f-5f0e59c4e070</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Once upon a time, there was a little girl name...</td>\n",
       "      <td>1d239f39-f5da-4573-bb52-9bb37e928a13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  Once upon a time, there was a big dog named Ma...   \n",
       "1  Once upon a time, there was a shy little girl ...   \n",
       "2  Once upon a time, there was a little boy named...   \n",
       "3  Once upon a time, there was a little girl name...   \n",
       "4  Once upon a time, there was a little girl name...   \n",
       "\n",
       "                                     id  \n",
       "0  ad9623c1-e820-417b-a894-8ac327a63781  \n",
       "1  d2857534-6995-4c30-b06f-bb842f683601  \n",
       "2  0fe524d8-ca21-4c04-b344-9af8015f52e1  \n",
       "3  87222f54-fc70-4f93-825f-5f0e59c4e070  \n",
       "4  1d239f39-f5da-4573-bb52-9bb37e928a13  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deduplicated_path = os.path.join(output_path, \"deduplicated\")\n",
    "\n",
    "pd.read_parquet(os.path.join(deduplicated_path, os.listdir(deduplicated_path)[0])).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "66cfca8f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Our final dataset has 1,799,252 rows\n"
     ]
    }
   ],
   "source": [
    "num_deduplicated = sum(\n",
    "    pq.read_metadata(os.path.join(deduplicated_path, f)).num_rows for f in os.listdir(deduplicated_path)\n",
    ")\n",
    "print(f\"Our final dataset has {num_deduplicated:,} rows\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92db9f62",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "We see we were able to remove ~320k duplicates from our dataset of 2.1 million rows using a single workflow."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
